Developer CD Series 1996 May: Tool Chest

home *** CD-ROM | disk | FTP | other *** search

/ Developer CD Series 1996 May: Tool Chest / Developer CD Series May 1996 (Tool Chest) (Apple Computer) (1996).iso / Tool Chest / Development Tools & Languages / • Other Platforms / PCCTS 1.31 / Documentation / UPDAT120.txt < prev next >

Wrap

Text File | 1995-03-10 | 35.9 KB | 1,523 lines | [TEXT/MPS ]

PCCTS 1.20 --- Release Notes Terence J. Parr University of Minnesota Army High-Performance Computing Research Center Minneapolis, MN 55415 parrt@acm.org Russell W. Quong School of Electrical Engineering Purdue University W. Lafayette, IN 47907 quong@ecn.purdue.edu William E. Cohen School of Electrical Engineering Purdue University W. Lafayette, IN 47907 cohenw@ecn.purdue.edu This document describes the 1.20 release of the Purdue Compiler Construction Tool Set (PCCTS). A number of new features have been added since 1.10 (August 1993), but the main addition is the introduction of C++ support. The C++ support will be described in a forthcoming paper---it is merely summarized here. PCCTS is in the public-domain and can be obtained at marvin.ecn.purdue.edu in pub/pccts/1.20. You can join the pccts-users mailing list dealing with tools ANTLR, DLG (and SORCERER) by emailing pccts-users-request@ahpcrc.umn.edu with a body of subscribe pccts-users your-name-or-ret-addr. The authors make no claims that this software will do what you want, that this manual is any good, or that the software actually works---use PCCTS at your own risk. Bug reports and/or cheery reports of its usefulness are very welcome, however. [This file was automatically converted from the LaTeX source; please see the Postscript version of this file where possible]. Introduction The 1.20 release is primarily to introduce C++ support for PCCTS, but includes some nice overall enhancements and important bug fixes. The C++ support is only of quality and, as such, can be expected to change in future releases. We anticipate more frequent releases to PCCTS in the future rather than big, twice-a-year, releases to provide faster bug fixes and feature enhancements. The main 1.20 features and enhancements are Added ``tokclass'' (define a set of tokens), ``.'' (wildcard operator), ``~'' (not operator). Added ``..'' token range operator. Added ``tokdefs'' so programmers can have predefined token type values. A new, improved genmk program. Line numbers are now tracked when using infinite lookahead. Added C++ support for ANTLR and DLG output. Added ``-o '' option to specify where all output should go. A few nasty code generation bugs were fixed as well as a few bugs relating to semantic predicate hoisting. Token Classes A token class is set of tokens that can be referenced as one entity; they are equivalent to a subrule consisting of its member tokens separated by ``''s. The basic syntax is: tokclass where is a token reference (either a token label or a regular expression in double-quotes) or a token class reference; token classes may have overlapping tokens. The difference between a token class and a subule lies in efficiency. A reference to a token class is a simple set membership test during parser execution rather than a linear search of the tokens in a subrule. Furthermore, the set membership will be much smaller than a series of if-statements in a recursive-descent parser. Note that automaton-based parsers (both and ) automatically perform this type of set membership (specifically, a table lookup), but lack the flexibility of recursive-descent parsers such as those constructed by ANTLR. Figure is a sample ANTLR 1.20 program that recognizes a simple mythical assembly language with statements such as: segment data a ds 42 b ds 13 segment code load r1, a load r2, b add r1,r2,r3 print r3 Referencing token class REGISTER is the same as referencing ( "r0" "r1" "r2" "r3" ) A wildcard token is also available that refers to an implied token class consisting of all tokens referenced within a grammar. It can be used to ignore pieces of the input: ig : "beginignore" . "endignore" ; which matches any single token between beginignore and endignore, or the wildcard can be used for error detection: if : "if" expr "then" stat . <<fprintf(stderr, "malformed if-statement");>> ; The programmer should be careful not to do things like this: ig : "beginignore" ( . )* "endignore" ; because the loop generated for the ( . )* block will never terminate---"endignore" is also matched by the wildcard. Rather than using the wildcard to match large token classes, it is often best to use the operator. For example, ig : "beginignore" ( "endignore" )* "endignore" ; where is the operator. The if example could be rewritten as: if : "if" expr "then" stat "if" <<fprintf(stderr, "malformed if-statement");>> ; The operator may be applied to token class references and token references only (it may not be applied to subrules, for example). The wildcard operator and the operator never result in a set containing the end-of-file token type. One final token operator has been introduced---the range operator of the form ... The vaue of must be less than and the values in between should be valid token types. In general, this feature should be used in conjunction with tokdefs so that the programmer controls the token type values. An example range operator is: tokdefs "mytokens.h" a : OpStart .. OpEnd operand ; This feature is perhaps unneeded due to the more powerful token class directive. New Directive tokdefs It is often the case that the user is interested in specifying the token types rather than having ANTLR generate its own; typically, this situation arises when the user wants to link an ANTLR-generated parser with a non-DLG-based scanner. To get ANTLR to use pre-assigned token types, specify tokdefs "" before any token definitions where is a file with only a list of defines or an enum definition with optional comments. When this directive is used, new token label definitions will not be allowed (either explicit definitions like token A or implicit definitions such as a reference to a token label in a rule). However, the programmer may attach regular expressions and lexical actions to the token labels defined in . For example, if contained: define A 2 and t.g contained: tokdefs "mytokens.h" token A "blah" a : A B; ANTLR would report the following error message: Antlr parser generator Version 1.20 1989-1994 t.g, line 5: error: implicit token definition not allowed with tokdefs New genmk The genmk program has been substantially upgraded. It now generates clean and scrub targets, generates more accurate file dependencies and, most importantly, generates makefiles for PCCTS C++ mode. The command line options are defined as follows: -CC: Generate C++ output from both ANTLR and DLG. -class : Name of the grammar class defined in the grammar files. This is only a valid option if -CC was seen before it on the command line. -dlg-class : Name of DLG lexer class (default is DLGLexer). This is only a valid option if -CC was seen before it on the command line. The option is placed on the DLG command line as the -cl option. -header : Name of the ANTLR standard header information (default=no file). -o : Directory where output files should go (default="."). This is very nice for keeping the source directory clear of ANTLR and DLG spawn. -project : Name of executable to create (default=t). -token-types : Token types are in this file; this option implies that the normal tokens.h is not to be generated or used for token type definitions (in C output mode, however, tokens.h may still be required because we have jammed some function prototypes in the file as well). -trees: Generate ASTs; basically turns on the -gt option of ANTLR, but also results in a target to compile the AST support file (in C++ mode only). -user-lexer: Do not create a DLG-based scanner. Turns on ANTLR -gx command line option. Turns off generation of targets for DLG scanners. For example, to create a makefile for a grammar in test.g with project name t that uses a DLG-based scanner, use: genmk -project t test.g To specify multiple files (comprising the same grammar) use: genmk -project t test.g test2.g test3.g To create a basic C++ mode makefile for a grammar class Expr in file test.g and project t use: genmk -project t -CC -class Expr test.g To rename the default DLGLexer class and (associated files) to Scanner for DLG, use: genmk -project t -CC -class Expr -dlg-class Scanner test.g To create a makefile for a C++ parser that uses a non-DLG-based scanner, use: genmk -project t -CC -class Expr -user-lexer test.g To create a makefile that uses a hand-built parser for a grammar that uses ``tokdefs "mytokens.h"'', use: genmk -project t -CC -class Expr -user-lexer -token-types mytokens.h test.g Line Numbers and Infinite Lookahead Because ANTLR-generated parsers that use infinite lookahead (i.e., syntactic predicates) scan the entire input stream before actual parsing begins, the normal line number variable zzline is meaningless---it always ``points'' to the last line in the file. To overcome this, we have added a new macro (member function infline(int ) in C++ mode): ZZINFLINE() for . ZZINFLINE(1) is the line number of the token about to be matched. The following example illustrates the use of the new macro: header <<include "charbuf.h">> <<main() ANTLR(stat(), stdin); >> token "[]+" <<zzskip();>> token "" <<zzskip(); zzline++;>> token Equal "=" token Semi ";" stat: (list Equal)? list Equal list ";" <<printf("list = list");>> list ";" <<printf("list");>> ; list: "(" elem ("," elem)* ")" ; elem: <<int line=0;>> <<line = ZZINFLINE(1);>> WORD <<printf("WORD at line d", line);>> <<line = ZZINFLINE(1);>> INT <<printf("INT at line d", line);>> ; token WORD "[a-zA-Z]+" token INT "[0-9]+" The line number recording before the recognition of WORD and INT in elem is done because ZZINFLINE(1) refers to the line number for the token about to be matched. This feature does not work correctly in C++ mode when -gk ANTLR option is used. C++ Parser Organization [The C++ Output of PCCTS 1.20 is considered only ``beta'' quality. Expect a future release to require changes in most parsers written assuming version 1.20 C++ output.] [Also note that the C++ output has only been tested under g++ 2.5.7 and cannot be guaranteed to compile under any other C++ compilers; indeed, g++ 2.4.5 will not compile our test cases.] Computer language recognition is generally viewed conceptually as a parser that parses tokens taken from a token stream filled by a lexical analyzer (scanner) that takes characters from an input stream. In practice, the token and character streams end up being merely function calls to an input routine while the separation between parser and scanner tends to become blurred. For the C++ output of ANTLR and DLG, we have chosen to create a class hierarchy that reflects the nice conceptual separation between recognition subtasks. Figure represents the class interactions that mirror the standard parsing block diagram. In C++ code, the block diagram is ``constructed'' by Attaching an input stream to a DLG-based scanner, DLGFileInput in(stdin); /* create an input stream for DLG to get chars from */ DLGLexer scan(in,2000);/* create scanner reading from stdin w/bufsize==2000 */ Providing a token to DLG for it to continually fill in (the scanner is totally separated from the parser and, hence, has no idea how big the programmer's token is)Note that if C++ allowed typenames to be passed around as objects, the scanner could simply be initialized with the correct typename. Also, since constructors cannot be virtual (polymorphic) a DLG-based token must be able to answer a message called makeToken(). ANTLRToken aToken; scan.setToken(aToken); Attaching the token stream to a parser Parser p(scan); To start parsing, it is sufficient to call the Parser member function associated with the grammar rule: p.startingrule(anyargs); To specify the name of the parser class to construct, the programmer encloses all rules and actions in class Parser ... Currently, exactly one class may be defined. For the defined class, ANTLR generates a derived class of ANTLRParser which accepts tokens (ANTLRToken) from a class derived from ANTLRTokenBase. This token stream can be either a user defined scanner or a DLG-based scanner. DLG generates a class (default name is DLGLexer) that is derived from DLGLexerBase. DLG-based scanners read input from derived classes of DLGInputStream, the most common being DLGFileInput. Figures and describe the files constructed from ANTLR and DLG in C++ mode in the case where ANTLR was given a grammar called file.g containing ``class Parser ...'' and DLG was given the usual parser.dlg. If the tokdefs directive were used, then the tokens.h file would not be generated by ANTLR and if the ANTLR command line option for user-defined scanners (-gx) were used, parser.dlg would not be generated (tokdefs and -gx operate independently of each other). Files Parser.h and Parser.C represent the class definition and support code for the output parser. The actual recursive-descent parser generated from file.g is placed in file.C; there may be multiple input files and for reasons of separate compilation, we do not put the parser functions into the Parser.C file. File tokens.h contains an enumerated type TokenType describing the set of defined token types. Files DLGLexer.C and DLGLexer.h embody the class definition of the scanner described by parser.dlg. The C++ support code for ANTLR and DLG output is not dependent on defines as the C output is. For example, ``define DEMANDLOOK'' is not generated by ANTLR in C++ mode. The support code senses a flag in the parser structure that indicates whether to consume lookahead on demand or to continually fill the lookahead ``pipe.'' The communication medium between scanner and parser is an ANTLRToken (the name is fixed for the class name) where the communications channel is an ANTLRTokenStream. Tokens are no longer simply a token type; further, attributes, as defined by ANTLR 1.10, are no longer defined. For version 1.20, we combine the previous definitions to create an abstract token, ANTLRToken under the ANTLRTokenBase hierarchy, that contains at least the token type, but may include anything else required by the user. When using DLG-based scanners, ANTLRToken must be derived from DLGBasedToken which adds behavior needed by DLG (DLG-based scanners need to be able to set/get the text of a token). A common token definition, ANTLRCommonToken, is predefined to be the usual token type plus a fixed-size text buffer (similar to the old Attrib definition in charbuf.h). The minimal token definition is still the size of an integer (sizeof(enum TokenType)). To access the tokens in a token stream from the parser, i variables are used. However, these variables are consistently pointers to type ANTLRToken in C++ mode whereas in C mode i variables are of type Attrib. The minimal programmer-supplied code requirements to get a C++ parser going are: A function that constructs the various objects in the C++ parser block diagram in Figure and invokes one of the parser member rules in your Parser class. A definition for ANTLRToken. An easy way to accomplish this is to do the following: typedef ANTLRCommonToken ANTLRToken; /* a token is token type and text */ A class definition in the grammar file(s). The various definitions need not be placed in the header action as in C output mode. The next section provides a complete example that illustrates all of the conventions described in this section. A Simple C++ Example Figure is a simple example that illustrates a simple DLG-based scanner with an ANTLR grammar class. Assuming that the example is in a file called test.g, an executable parser may be obtained via the following command sequence: antlr -CC test.g dlg -CC parser.dlg C++ -c -I/usr/local/pccts/include -o test.o test.C C++ -c -I/usr/local/pccts/include -o Expr.o Expr.C C++ -c -I/usr/local/pccts/include -o DLGLexer.o DLGLexer.C C++ -o t test.o Expr.o DLGLexer.o /usr/local/pccts/antlrx.o /usr/local/pccts/dlgx.o where antlrx.o and dlgx.o are support code modules. Grammar Classes A grammar class is defined as follows class Parser The actions may contain any normal C++ code that is valid within a C++ class definitionActually, any normal ANTLR directive such as token definitions may be placed inside, but it is best to separate them from the grammar class. For example, class Parser <<public: int i;>> <<int f() ; >> rule : A B ; would result in a C++ class definition of: class Parser : public ANTLRParser protected: static ANTLRChar *tokentbl[]; private: static SetWordType setwd1[4]; private: public: Parser(ANTLRTokenStream *lexer); Parser(ANTLRTokenStream *lexer, TokenType eof); void rule(); public: int i; int f() ; ; ANTLR Parser Classes ANTLR Tokens In C++ output mode, a token is an object that represents an abstraction of a lexical object found on the input stream by the scanner. An ANTLR-generated parser accepts input as sequence of tokens in the form of a token stream where the token stream is usually managed by a lexical analyzer (typically DLG). A token contains, minimally, a token type that is used by the parser to recognize grammatical structure. Optionally, the programmer can add fields specific to the token class to carry application-specific information. To facilitate error reporting, it is common practice to include a text string associate with each token and possibly the line number. Figure depicts the hierarchy of token base classes available to the programmer. Because an ANTLR parser must be able to make local copies of input tokens, the name of the programmer's token definition is fixed to be ANTLRToken; we do this for efficiency reasons as specifying a size to the parser and then asking it to allocate memory on the heap is much slower than defining a local object on the stack. The most basic token, ANTLRTokenBase knows how to set and get its token type and to get its text representation; the latter is needed to satisfy an ANTLR-generated parser's urge to print meaningful syntax error messages (because the text of a token is not required, this function returns the empty string unless redefined in a subclass). ANTLRToken must have a blank constructor: ANTLRToken() ; so that the parser can allocate local copies if required. When using DLG-based scanners, additional behavior is required of a token. DLG must be able to set the text and token type for an ANTLRToken. A virtual function called virtual void makeToken(TokenType t, ANTLRChar *s); is used because constructors cannot be virtual in C++ and we cannot pass the class to DLG so that it knows how to create and initialize an ANTLRToken. It is invoked in the following manner: tokentofill->makeToken(gettok(), lextext); where tokentofill is a pointer to an ANTLRToken. This token, again, cannot be created by DLG because it's size is unknown to the DLG support code. It must be created by the programmer (normally just a local variable) and its address passed to the DLG-based scanner via member function setToken(). The programmer defines an ANTLRToken class by deriving a class from DLGBasedToken (if DLG is being used---ANTLRTokenBase otherwise). For example, a common token is provided with PCCTS: class ANTLRCommonToken : public DLGBasedToken protected: ANTLRChar text[ANTLRCommonTokenTEXTSIZE+1]; public: ANTLRCommonToken(TokenType t, ANTLRChar *s) : DLGBasedToken(t,s) setText(s); ANTLRCommonToken() ; ANTLRChar *getText() return text; void setText(ANTLRChar *s) strncpy(text, s, ANTLRCommonTokenTEXTSIZE); ; Because the parser is attached to a DLG-based scanner, the parser has access to the current input line number; consequently, the line number is not stored in this token definition. In C output mode, a type called Attrib is defined by the programmer and a required macro zzcrattr() is invoked to set the attributes. In C++, these items correspond to ANTLRToken and makeToken(), respectively. There is not a software stack of attributes or ANTLRTokens in C++ output mode---local token copies are local variables (on the hardware stack) for efficiency and ease of debugging. ANTLR Token Streams In C++ mode, an ANTLR-generated parser consumes ANTLRTokens from an ANTLRTokenStream which is an object that acts like a pipeline between the parser and lexical analyzer; the lexical analyzer, in turn, consumes characters from a text-based input stream. An ANTLRTokenStream supplies a stream of tokens by returning the ``next'' token for each call to nextToken() and also knows how to return the current text and line number of the current token. An ANTLRTokenStream can be anything from a simple array of TokenTypes to a full DLG-based scanner. Figure depicts the token stream hierarchy. To attach a non-DLG-based scanner to an ANTLR-generated parser, the programmer subclasses ANTLRTokenStream. For example, class MyTokenStream : public ANTLRTokenStream private: char c; public: MyTokenStream() c = getchar(); ANTLRTokenBase *nextToken(); ; where MyTokenStream::nextToken() is some function that embodies a lexical analyzer; i.e., it breaks up the input stream into vocabulary symbols. nextToken() returns a pointer to an ANTLRToken that it has initialized. ANTLR Parsers All ANTLR-generated parsers in C++ mode are derived classes of ANTLRParser. No preprocessor symbols are used to define the structure of the parser as is done in C mode. In this way, the ANTLR-parser support code can be compiled separately and linked with different parsers. The support code senses object variables such as demandlook and canuseinflook. Figure depicts the parser class hierarchy. A few of the public interface functions are worth mentioning. To access the token type of lookahead, use inline TokenType LA(int i); To access a pointer to the ANTLRToken of lookahead, use inline ANTLRTokenBase *LT(int i); When using a non-DLG-based scanner, the user must inform the parser what token type should be considered end-of-input. This token type will then be used by the error recovery facilities to scan past bogus tokens without going beyond the end of input. void setEofToken(TokenType t); The programmer commonly wishes to modify the standard error reporting facility. To do so in C++ mode, simply subclass Parser and redefine syn() where Parser is your grammar class. void syn(ANTLRTokenBase *tok, ANTLRChar *egroup, SetWordType *eset, TokenType etok, int k); When syn() is redefined, so should edecode(): void edecode(SetWordType *); Upon catastrophic error, the following function is called (and should never return). The programmer can subclass Parser and redefine ANTLRPanic: void ANTLRPanic(ANTLRChar *msg); The following functions are the analog of the C mode infinite lookahead macros. int infLAvalid(int i); int infLA(int i); inline infline(int i); The following protected member functions can also be redefined: virtual void tracein(ANTLRChar *r); virtual void traceout(ANTLRChar *r); The various ANTLR parser class definitions could have benefited greatly from templates and multiple inheritance, but no current C++ compiler implements these reliably; i.e., because of features like this C++ is totally nonportable due to compiler limitations. AST Classes To use ASTs with ANTLR in C++ mode, the user simply derives a class from either ASTBase or ASTDoublyLinkedBase and adds the desired fields. The AST classes automatically know how to do a preorder traversal of an AST, the programmer should redefine virtual void preorderaction() ; virtual void preorderbeforeaction() printf(" ("); virtual void preorderafteraction() printf(" )"); in their derived class if required. Figure provides a simple example of how to use ASTs. The zzmkast() and zzcrast() functions are now embodied by the AST class constructor. -variables are used as before and are pointers to the node(s) created by token and rule references. DLG Classes DLG generates a class (default name is DLGLexer) that is derived from DLGLexerBase. DLG-based scanners read input from derived classes of DLGInputStream, the most common being DLGFileInput. A number of functions can be redefined in derived classes; the interesting ones are: virtual void erraction(); void DLGPanic(DLGChar *msg); virtual ANTLRTokenBase *nextToken(); The nextToken() function can be redefined in a subclass so that it knows what an ANTLRToken looks like. The standard nextToken() does not have this luxury and, hence, the user is required to provide DLG-based scanners with an ANTLRToken to fill in. The function trackColumns() can be called to turn on column tracking. This is analogous to setting preprocessor symbol ZZCOL in C mode. ANTLR Parsers and Hand-Built Scanners Because of the clean separation of parsing subtasks followed by ANTLR, it is a trivial matter to link in a hand-built scanner or any non-DLG-based scanner. Simply turn on the -gx ANTLR command line option, which turns off the generation of DLG input, and attach an instance of your scanner to an instance of the parser class generated by ANTLR. Figures , , and represent a complete example. The call to setEofToken() is done to inform ANTLR what token type is considered end of input; this is necessary for all hand-built parsers so that ANTLR does not try to resynchronize, after a syntax error, beyond the end of input. New Supplied Files antlrx.h: All ANTLR parser support classes and the ANTLRParser class itself. antlrx.C: ANTLR parser support code. dlgx.h: DLG scanner support classes and DLGLexerBase class. dlgx.C: DLG scanner support code. astx.h: AST class definition. astx.C: AST support code. DLexer.C: Support code that must be aware of the particular scanner generated by DLG. This is an ugly mechanism and will change in future versions. AToken.h: Definitions for classes ANTLRTokenBase, DLGBasedToken, and ANTLRCommonToken. ATokenStream.h: Definition of class ANTLRTokenStream. -Variables in C++ Mode Because attributes do not exist in C++ mode, -variables point to ANTLRTokens. Further, -variables do not exist for rule references. Rule arguments and return values should be used instead. We anticipate the removal of -variables all together in future releases in favor of labels for rule elements such as in the tree-parser generator SORCERER. -variables are pointers to ANTLRTokens exclusively in C++ mode. Semantic Predicates Semantic predicates should reference LT()->getText() instead of LATEXT() as LATEXT does not exist. Converting a 1.10 Grammar to 1.20 C++ Grammar This section describes the procedure we used to convert the PCCTS 1.10 C grammar from C to C++ mode (this covers most, but not all issues); ``your mileage may vary.'' Remove parser directive if any. typedef ANTLRToken to something or derive a class. Remove include "charbuf.h" or other previous Attrib definition. Add a grammar class ``wrapper'' around your grammar rules. Convert AST stuff: (1) make a class AST definition. (2) Add fields in ASTFIELDS macro to the class AST definition as fields. (3) Remove zzcrast and make that the constructor. (4) Remove zzmkast if you have it and make it another constructor (no need to convert any [args] references in the grammar). Add a include "DLGLexer.h" in the parser (or whatever you call the lexical analyzer class). Convert all to . Convert all LATEXT() to LT()->getText(). Definitions in header can come afterwards (i.e., do not use the directive anymore). Convert all 0=, =, or = to a return value or ``by-reference'' argument. Convert the ANTLR() macro reference into the series of object definitions outline in this document. Semantic Predicate Hoisting The hoisting of predicates in version 1.10 had a number of bugs that have been fixed. In addition, version 1.20 has changed the semantics of semantic predicate hoisting slightly to gain a useful feature. We begin by describing the bug fix. The following grammar now behaves as advertised: a : <<p1>>? b ID ; b : <<p2>>? ID <<p3>>? ID ; It results in the following code for a: void a(void) ... if ( (LA(1)==ID)((p1)((p2)(p3))) ) if (!(p1)) zzfailedpred((ANTLRChar *)" p1"); //unused b(); else if ( (LA(1)==ID) ) zzmatch(ID); zzCONSUME; else ... Note that the following semantics indicate the correct semantic validity of production one of a: ``p1 and (p2 or p3).'' In version 1.10, the ``or'' was an ``and.'' In 1.10, we indicated that predicates were hoisting ONLY if the grammar was syntactically ambiguous. This had the unfortunate effect of making it impossible to include a predicate in the loop decision for (..)+ and (..)* subrules if only one alternative was present. We have changed the meaning of semantic predicates slightly so that if only one alternative exists in a looping subrule, all visible predicates are ALWAYS hoisting. For example, a : <<int i=5;>> // match exactly 5 A's ( <<i>0>>? A <<i--;>> )+ ; In 1.10, this would have resulting in a loop that only tested the lookahead. In 1.20, the following is generated for the (..)+ loop: if (((i>0))) do if (!(i>0)) zzfailedpred((ANTLRChar *)" i>0"); // unused zzLOOP(zztasp2); while ( (LA(1)==A)((i>0)) ); C AST Changes In C mode, the programmer can define the preprocessor symbol USERDEFINEDAST which allows the programmer to define the AST type themselves. The macro ASTREQUIREDFIELDS is the minimum set of AST fields needed by ANTLR; as such, it functions like inheritance in C++. For example, typedef struct ast ASTREQUIREDFIELDS; // order is unimportant my stuff; AST; The structure name must be ast for the ASTREQUIREDFIELDS to work. If it is not used, the structure name can be anything. New Command-Line Options The following ANTLR command line options are new: -CC: Generate C++ output. -o: Directory where output files should go (default="."). This is very nice for keeping the source directory clear of ANTLR and DLG spawn. -ct: Do not make copies of tokens passed to the parser in C++ mode (default=to copy). When using DLG in conjunction with ANTLR, you will always want ANTLR to make copies because DLG only has space for one ANTLRToken (which is passed to the scanner with setToken); this address is always returned and, hence, without copies, all -variables would point to the same ANTLRToken. The -ai option has been removed in anticipation of an ANTLR graphical interface. The following DLG command line options are new: -CC: Generate C++ output. -o: Directory where output files should go (default="."). This is very nice for keeping the source directory clear of ANTLR and DLG spawn. -cl : Specify a class name for DLG to generate. The default is DLGLexer. will be a subclass of DLGLexerBase. Also note that in C++ mode, DLG now does not accept the output file name on the command line. The class name specified (or the default of DLGLexer) is used to derive the output file name: dlg Miscellaneous New Additions The following minor changes were made: A new character type is used for ANTLR and DLG ANTLRChar and DLGChar respectively in C++ mode. Normally these are 'char', but you can change the typedef to whatever you wish (you can even make it a class). ASTs were incorrectly handled in conjunction with 1.10 syntactic predicates---this has been fixed. The return values of rules were still assigned in guess mode (when using syntactic predicates); the arguments are still evaluated. This is perhaps not too bright. Warnings about missing header now require that -w2 ANTLR option be set. Allows parser to come first (before header) Fixed a bug so that ``(A B)+ A C'' will now terminate the loop upon ``A C'' whereas before it would just loop forever (it was not using enough lookahead). When a token label (for which there is no regular expression) is referenced in a rule, a warning is generated (if the ANTLR command line option -w2 is specified). Fixed a nasty bug that caused ANTLR to loop forever (and a day) upon very large grammars with lots of optional subrules. ANTLR itself tends to give better error messages now; e.g., lexical errors give the file now and grammatical errors (employing errclasses) are more readable. Future Our work on ANTLR continues to be heavily influenced by the feedback from our industrial and academic user community. As such, we are currently developing or planning the following improvements and tools. Good error recovery and reporting is notoriously difficult to achieve with parser generators, especially -based tools. We are developing a sophisticated error handling mechanism analogous to C++ exception handling called parser exception handling that approaches the flexibility of hand-built parsers. The recognition strength of hand-built parsers arises from the fact that arbitrarily-complex expressions can be used to distinguish between alternative productions. We will introduce a new type of predicate called a prediction predicate that constitutes the entire prediction expression for a particular production; i.e., ANTLR does not generate code to test lookahead for the associated production. We anticipate the notation: ``<< this-is-the-entire-prediction-expression>>?!''. A graphical user interface is planned and a coder has been tentatively ``pressed'' into service. This ``GUI'' will display syntax diagrams on the screen and, hence, ambiguities in the grammar can be highlighted. The output of the GUI will be an ANTLR grammar or a PostScript representation of the syntax diagram. We intend to yank the infinite lookahead mechanism out of ANTLRParser and put it in ANTLRTokenStream where it belongs. [The users of PCCTS should be forewarned that we anticipate a break with total backward compatibility for a future release (perhaps PCCTS 2.00). This release is intended to fix the odious C output generated by the current version of ANTLR/DLG and will result in a modified grammar meta-language plus the removal of some parsing modes. Any book on PCCTS to be written will describe this version of reality. Also remember that the C++ output is going to change as we learn more about it.] Acknowledgements Thanks are due to Sumana Srinivasan, Mike Monegan, and Steve Naroff of NeXT, Inc. for their extensive help in the initial definition of the ANTLR C++ parser. They are also instrumental in the ongoing design of parser exception handling. We thank Gary Funck at Intrepid for his extensive testing of ANTLR and DLG plus his constant stream of excellent suggestions. Steve Robenalt at Rockwell is single-handedly pushing the comp.compilers.pccts news group through, is helping with the workshop, and is porting PCCTS to a number of different platforms. We thank Tom Moog (moog@polhode.com) for his fantastic NOTES.newbie information. Ariel Tamches (tamches@cs.wisc.edu) deserves credit for spending a week of his Christmas vacation in the wilds of Minnesota helping me with the C++ output; he developed the majority of the code for the hand-built scanner C++ example. The C++ output was also influenced by Thom Wood ( twood@tcis3.tcis.com) and Randy Helzerman helz@ecn.purdue.edu. Anthony Green at Visible Decisions, John Hall at Worcester Polytechnic Institute, Devin Hooker at Ellery Systems, Kenneth D. Weinert at Information Handling Services, and Roy Levow at Florida Atlantic University helped beta test 1.20. Sriram Sankar at Sun Microsystems has help debug a number of features including the infinite lookahead line number tracking and has provided a fix to make DLG char-size independent, which we hope to include soon. John Hall (jhall@ivy.wpi.edu) ported PCCTS to Visual C++. We would also like to thank the multitude of other users of PCCTS for their excellent suggestions and beta-testing of the new C++ parsers.